Implementation and Evaluation of a Language Identification System for Mono- and Multi-lingual Texts

نویسندگان

  • Olga Artemenko
  • Thomas Mandl
  • Margaryta Shramko
  • Christa Womser-Hacker
چکیده

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multilingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multilingual document .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders Y. Adib, Ph.D. Z. Sharifi N. Mahmoodi To  compare both the reading ability and academic achievement among Farsi speaking mono-lingual fifth graders and their bilingual Aazari and Kordi counterparts, three samples of 153, 132, and 145 (total 430) such students from three cities o...

متن کامل

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors.  Also developing Persian tools will provide Persian progr...

متن کامل

Neural Relation Extraction with Multi-lingual Attention

Relation extraction has been widely used for finding unknown relational facts from the plain text. Most existing methods focus on exploiting mono-lingual data for relation extraction, ignoring massive information from the texts in various languages. To address this issue, we introduce a multi-lingual neural relation extraction framework, which employs monolingual attention to utilize the inform...

متن کامل

Robust Cross-Lingual Genre Classification through Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

The 5th Workshop on Building and Using Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005